Dynamic Scheduling of MapReduce Shuffle under Bandwidth Constraints
نویسندگان
چکیده
Whether it is for e-science or business, the amount of data produced every year is growing at a high rate. Managing and processing those data raises new challenges. MapReduce is one answer to the need for scalable tools able to handle the amount of data. It imposes a general structure of computation and let the implementation perform its optimizations. During the computation, there is a phase called Shuffle where every node sends a possibly large amount of data to every other node. This report proposes and evaluates six algorithms to improve data transfers during the Shuffle phase under bandwidth constraints. Key-words: Big Data, MapReduce, shuffle, scheduling, network, contention, bandwidth, regulation Ordonnancement Dynamique du Shuffle MapReduce sous Contrainte de Bande Passante Résumé : Que ce soit pour l’e-science ou pour les affaires, la quantité de données produites chaque année augmente à une vitesse vertigineuse. Gérer et traiter ces données soulève de nouveaux défis. MapReduce est l’une des réponses aux besoins d’outils qui passent à l’échelle et capables de gérer ces volumes de données. Il impose une structure générale de calcul et laisse l’implémentation effectuer ses optimisations. Durant l’une des phases du calcul appellée Shuffle, tous les nœuds envoient des données potentiellement grosses à tous les autres nœuds. Ce rapport propose et évalue six algorithmes pour améliorer le transfert des données durant cette phase de Shuffle sous des contraintes de bande passante. Mots-clés : Big Data, MapReduce, shuffle, ordonnancement, réseau, contention, bande passante, régulation Dynamic Scheduling of MapReduce Shuffle Under Bandwidth Constraints 3
منابع مشابه
ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters
MapReduce clusters are usually multi-tenant (i.e., shared among multiple users and jobs) for improving cost and utilization. The performance of jobs in a multitenant MapReduce cluster is greatly impacted by the allMap-to-all-Reduce communication, or Shuffle, which saturates the cluster’s hard-to-scale network bisection bandwidth. Previous schedulers optimize Map input locality but do not consid...
متن کاملMapReduce with communication overlap (MaRCO)
MapReduce is a programming model from Google for cluster-based computing in domains such as search engines, machine learning, and data mining. MapReduce provides automatic data management and fault tolerance to improve programmability of clusters. MapReduce’s execution model includes an all-map-to-all-reduce communication, called the shuffle, across the network bisection. Some MapReductions mov...
متن کاملScheduling MapReduce Jobs and Data Shuffle on Unrelated Processors
We propose constant approximation algorithms for generalizations of the Flexible Flow Shop (FFS) problem which form a realistic model for non-preemptive scheduling in MapReduce systems. Our results concern the minimization of the total weighted completion time of a set of MapReduce jobs on unrelated processors and improve substantially on the model proposed by Moseley et al. (SPAA 2011) in two ...
متن کاملOn Scheduling Algorithms for MapReduce Jobs in Heterogeneous Clouds with Budget Constraints
In this paper, we consider task-level scheduling algorithms with respect to budget constraints for a bag of MapReduce jobs on a set of provisioned heterogeneous (virtual) machines in cloud platforms. The heterogeneity is manifested in the popular ”pay-as-you-go” charging model where the service machines with different performance would have different service rates. We organize a bag of jobs as ...
متن کاملDynamic Cargo Trains Scheduling for Tackling Network Constraints and Costs Emanating from Tardiness and Earliness
This paper aims to develop a multi-objective model for scheduling cargo trains faced by the costs of tardiness and earliness, time limitations, queue priority and limited station lines. Based upon the Islamic Republic of Iran Railway Corporation (IRIRC) regulations, passenger trains enjoy priority over other trains for departure. Therefore, the timetable of cargo trains must be determined based...
متن کامل